{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Simple RNN Encode-Decoder for Translation\n",
    "\n",
    "**Learning Objectives**\n",
    "1. Learn how to create a tf.data.Dataset for seq2seq problems\n",
    "1. Learn how to train an encoder-decoder model in Keras\n",
    "1. Learn how to save the encoder and the decoder as separate models \n",
    "1. Learn how to piece together the trained encoder and decoder into a translation function\n",
    "1. Learn how to use the BLUE score to evaluate a translation model\n",
    "\n",
    "## Introduction\n",
    "\n",
    "In this lab we'll build a translation model from Spanish to English using a RNN encoder-decoder model architecture.\n",
    "We will start by creating train and eval datasets (using the `tf.data.Dataset` API) that are typical for seq2seq problems. Then we will use the Keras functional API to train an RNN encoder-decoder model, which will save as two separate models, the encoder and decoder model. Using these two separate pieces we will implement the translation function.\n",
    "At last, we'll benchmark our results using the industry standard BLEU score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pip freeze | grep nltk || pip install nltk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pickle\n",
    "import sys\n",
    "\n",
    "import nltk\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "import tensorflow as tf\n",
    "from tensorflow.keras.layers import (\n",
    "    Dense,\n",
    "    Embedding,\n",
    "    GRU,\n",
    "    Input,\n",
    ")\n",
    "from tensorflow.keras.models import (\n",
    "    load_model,\n",
    "    Model,\n",
    ")\n",
    "\n",
    "import utils_preproc\n",
    "\n",
    "print(tf.__version__)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "SEED = 0\n",
    "MODEL_PATH = 'translate_models/baseline'\n",
    "DATA_URL = 'http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip'\n",
    "LOAD_CHECKPOINT = False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tf.random.set_seed(SEED)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Downloading the Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll use a language dataset provided by http://www.manythings.org/anki/. The dataset contains Spanish-English  translation pairs in the format:\n",
    "\n",
    "```\n",
    "May I borrow this book?\t¿Puedo tomar prestado este libro?\n",
    "```\n",
    "\n",
    "The dataset is a curated list of 120K translation pairs from http://tatoeba.org/, a platform for community contributed translations by native speakers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "path_to_zip = tf.keras.utils.get_file(\n",
    "    'spa-eng.zip', origin=DATA_URL, extract=True)\n",
    "\n",
    "path_to_file = os.path.join(\n",
    "    os.path.dirname(path_to_zip),\n",
    "    \"spa-eng/spa.txt\"\n",
    ")\n",
    "print(\"Translation data stored at:\", path_to_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv(\n",
    "    path_to_file, sep='\\t', header=None, names=['english', 'spanish'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.sample(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the `utils_preproc` package we have written for you,\n",
    "we will use the following functions to pre-process our dataset of sentence pairs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sentence Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `utils_preproc.preprocess_sentence()` method does the following:\n",
    "1. Converts sentence to lower case\n",
    "2. Adds a space between punctuation and words\n",
    "3. Replaces tokens that aren't a-z or punctuation with space\n",
    "4. Adds `<start>` and `<end>` tokens\n",
    "\n",
    "For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw = [\n",
    "    \"No estamos comiendo.\",\n",
    "    \"Está llegando el invierno.\",\n",
    "    \"El invierno se acerca.\",\n",
    "    \"Tom no comio nada.\",\n",
    "    \"Su pierna mala le impidió ganar la carrera.\",\n",
    "    \"Su respuesta es erronea.\",\n",
    "    \"¿Qué tal si damos un paseo después del almuerzo?\"\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "processed = [utils_preproc.preprocess_sentence(s) for s in raw]\n",
    "processed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sentence Integerizing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `utils_preproc.tokenize()` method does the following:\n",
    "    \n",
    "1. Splits each sentence into a token list\n",
    "1. Maps each token to an integer\n",
    "1. Pads to length of longest sentence \n",
    "\n",
    "It returns an instance of a [Keras Tokenizer](https://keras.io/preprocessing/text/)\n",
    "containing the token-integer mapping along with the integerized sentences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "integerized, tokenizer = utils_preproc.tokenize(processed)\n",
    "integerized"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The outputted tokenizer can be used to get back the actual works\n",
    "from the integers representing them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer.sequences_to_texts(integerized)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating the tf.data.Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `load_and_preprocess`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 1\n",
    "\n",
    "Implement a function that will read the raw sentence-pair file\n",
    "and preprocess the sentences with `utils_preproc.preprocess_sentence`.\n",
    "\n",
    "The `load_and_preprocess` function takes as input\n",
    "- the path where the sentence-pair file is located\n",
    "- the number of examples one wants to read in\n",
    "\n",
    "It returns a tuple whose first component contains the english\n",
    "preprocessed sentences, while the second component contains the\n",
    "spanish ones:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_and_preprocess(path, num_examples):\n",
    "    with open(path_to_file, 'r') as fp:\n",
    "        lines = fp.read().strip().split('\\n')\n",
    "\n",
    "   \n",
    "    sentence_pairs =  # TODO 1a\n",
    "\n",
    "    return zip(*sentence_pairs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "en, sp = load_and_preprocess(path_to_file, num_examples=10)\n",
    "\n",
    "print(en[-1])\n",
    "print(sp[-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `load_and_integerize`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2\n",
    "\n",
    "Using `utils_preproc.tokenize`, implement the function `load_and_integerize` that takes as input the data path along with the number of examples we want to read in and returns the following tuple:\n",
    "\n",
    "```python\n",
    "  (input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer)\n",
    "```\n",
    "\n",
    "where \n",
    "\n",
    "\n",
    "* `input_tensor` is an integer tensor of shape `(num_examples, max_length_inp)` containing the integerized versions of the source language sentences\n",
    "* `target_tensor` is an integer tensor of shape `(num_examples, max_length_targ)` containing the integerized versions of the target language sentences\n",
    "* `inp_lang_tokenizer` is the source language tokenizer\n",
    "* `targ_lang_tokenizer` is the target language tokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_and_integerize(path, num_examples=None):\n",
    "\n",
    "    targ_lang, inp_lang = load_and_preprocess(path, num_examples)\n",
    "\n",
    "    # TODO 1b\n",
    "    input_tensor, inp_lang_tokenizer = # TODO\n",
    "    target_tensor, targ_lang_tokenizer = # TODO\n",
    "\n",
    "    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train and eval splits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll split this data 80/20 into train and validation, and we'll use only the first 30K examples, since we'll be training on a single GPU. \n",
    " \n",
    "Let us set variable for that:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "TEST_PROP = 0.2\n",
    "NUM_EXAMPLES = 30000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's load and integerize the sentence paris and store the tokenizer for the source and the target language into the `int_lang` and `targ_lang` variable respectively:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_tensor, target_tensor, inp_lang, targ_lang = load_and_integerize(\n",
    "    path_to_file, NUM_EXAMPLES)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us store the maximal sentence length of both languages into two variables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_length_targ = target_tensor.shape[1]\n",
    "max_length_inp = input_tensor.shape[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are now using scikit-learn `train_test_split` to create our splits:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "splits = train_test_split(\n",
    "    input_tensor, target_tensor, test_size=TEST_PROP, random_state=SEED)\n",
    "\n",
    "input_tensor_train = splits[0]\n",
    "input_tensor_val = splits[1]\n",
    "\n",
    "target_tensor_train = splits[2]\n",
    "target_tensor_val = splits[3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's make sure the number of example in each split looks good:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(len(input_tensor_train), len(target_tensor_train),\n",
    " len(input_tensor_val), len(target_tensor_val))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `utils_preproc.int2word` function allows you to transform back the integerized sentences into words. Note that the `<start>` token is alwasy encoded as `1`, while the `<end>` token is always encoded as `0`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Input Language; int to word mapping\")\n",
    "print(input_tensor_train[0])\n",
    "print(utils_preproc.int2word(inp_lang, input_tensor_train[0]), '\\n')\n",
    "\n",
    "print(\"Target Language; int to word mapping\")\n",
    "print(target_tensor_train[0])\n",
    "print(utils_preproc.int2word(targ_lang, target_tensor_train[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create tf.data dataset for train and eval"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3\n",
    "\n",
    "Implement the `create_dataset` function that takes as input\n",
    "* `encoder_input` which is an integer tensor of shape `(num_examples, max_length_inp)` containing the integerized versions of the source language sentences\n",
    "* `decoder_input` which is an integer tensor of shape `(num_examples, max_length_targ)`containing the integerized versions of the target language sentences\n",
    "\n",
    "It returns a `tf.data.Dataset` containing examples for the form\n",
    "\n",
    "```python\n",
    "        ((source_sentence, target_sentence), shifted_target_sentence)\n",
    "```\n",
    "\n",
    "where `source_sentence` and `target_setence` are the integer version of source-target language pairs and `shifted_target` is the same as `target_sentence` but with indices shifted by 1. \n",
    "\n",
    "**Remark:** In the training code, `source_sentence`  (resp. `target_sentence`) will be fed as the encoder (resp. decoder) input, while `shifted_target` will be used to compute the cross-entropy loss by comparing the decoder output with the shifted target sentences. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_dataset(encoder_input, decoder_input):\n",
    "    \n",
    "    # shift ahead by 1\n",
    "    target = tf.roll(decoder_input, -1, 1)\n",
    "\n",
    "    # replace last column with 0s\n",
    "    zeros = tf.zeros([target.shape[0], 1], dtype=tf.int32)\n",
    "    target = tf.concat((target[:, :-1], zeros), axis=-1)\n",
    "\n",
    "    dataset = # TODO\n",
    "\n",
    "    return dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now create the actual train and eval dataset using the function above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "BUFFER_SIZE = len(input_tensor_train)\n",
    "BATCH_SIZE = 64"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_dataset = create_dataset(\n",
    "    input_tensor_train, target_tensor_train).shuffle(\n",
    "    BUFFER_SIZE).repeat().batch(BATCH_SIZE, drop_remainder=True)\n",
    "\n",
    "\n",
    "eval_dataset = create_dataset(\n",
    "    input_tensor_val, target_tensor_val).batch(\n",
    "    BATCH_SIZE, drop_remainder=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the RNN encoder-decoder model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use an encoder-decoder architecture, however we embed our words into a latent space prior to feeding them into the RNN. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "EMBEDDING_DIM = 256\n",
    "HIDDEN_UNITS = 1024\n",
    "\n",
    "INPUT_VOCAB_SIZE = len(inp_lang.word_index) + 1\n",
    "TARGET_VOCAB_SIZE = len(targ_lang.word_index) + 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 4\n",
    "\n",
    "Implement the encoder network with Keras functional API. It will\n",
    "* start with an `Input` layer that will consume the source language integerized sentences\n",
    "* then feed them to an `Embedding` layer of `EMBEDDING_DIM` dimensions\n",
    "* which in turn will pass the embeddings to a `GRU` recurrent layer with `HIDDEN_UNITS`\n",
    "\n",
    "The output of the encoder will be the `encoder_outputs` and the `encoder_state`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "encoder_inputs = Input(shape=(None,), name=\"encoder_input\")\n",
    "\n",
    "encoder_inputs_embedded = # TODO\n",
    "\n",
    "encoder_rnn = # TODO\n",
    "\n",
    "encoder_outputs, encoder_state = encoder_rnn(encoder_inputs_embedded)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 5\n",
    "\n",
    "Implement the decoder network, which is very similar to the encoder network.\n",
    "\n",
    "It will\n",
    "* start with an `Input` layer that will consume the source language integerized sentences\n",
    "* then feed that input to an `Embedding` layer of `EMBEDDING_DIM` dimensions\n",
    "* which in turn will pass the embeddings to a `GRU` recurrent layer with `HIDDEN_UNITS`\n",
    "\n",
    "**Important:** The main difference with the encoder, is that the recurrent `GRU` layer will take as input not only the decoder input embeddings, but also the `encoder_state` as outputted by the encoder above. This is where the two networks are linked!\n",
    "\n",
    "The output of the encoder will be the `decoder_outputs` and the `decoder_state`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "decoder_inputs = Input(shape=(None,), name=\"decoder_input\")\n",
    "\n",
    "decoder_inputs_embedded = # TODO\n",
    "\n",
    "decoder_rnn = # TODO\n",
    "\n",
    "decoder_outputs, decoder_state = decoder_rnn(\n",
    "    decoder_inputs_embedded, initial_state=encoder_state)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last part of the encoder-decoder architecture is a softmax `Dense` layer that will create the next word probability vector or next word `predictions` from the `decoder_output`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "decoder_dense = Dense(TARGET_VOCAB_SIZE, activation='softmax')\n",
    "\n",
    "predictions = decoder_dense(decoder_outputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 6\n",
    "\n",
    "To be able to train the encoder-decoder network defined above, create a trainable Keras `Model` by specifying which are the `inputs` and the `outputs` of our problem. They should correspond exactly to what the type of input/output in our train and eval `tf.data.Dataset` since that's what will be fed to the `inputs` and `outputs` we declare while instantiating the Keras `Model`.\n",
    "\n",
    "While compiling our model, we should make sure that the loss is the `sparse_categorical_crossentropy` so that we can compare the true word indices for the target language as outputted by our train `tf.data.Dataset` with the next word `predictions` vector as outputted by the decoder:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = # TODO\n",
    "\n",
    "model.compile(# TODO)\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now train the model!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "STEPS_PER_EPOCH = len(input_tensor_train)//BATCH_SIZE\n",
    "EPOCHS = 1\n",
    "\n",
    "\n",
    "history = model.fit(\n",
    "    train_dataset,\n",
    "    steps_per_epoch=STEPS_PER_EPOCH,\n",
    "    validation_data=eval_dataset,\n",
    "    epochs=EPOCHS\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implementing the translation (or decoding) function\n",
    "\n",
    "We can't just use model.predict(), because we don't know all the inputs we used during training. We only know the encoder_input (source language) but not the decoder_input (target language), which is what we want to predict (i.e., the translation of the source language)!\n",
    "\n",
    "We do however know the first token of the decoder input, which is the `<start>` token. So using this plus the state of the encoder RNN, we can predict the next token. We will then use that token to be the second token of decoder input, and continue like this until we predict the `<end>` token, or we reach some defined max length.\n",
    "\n",
    "So, the strategy now is to split our trained network into two independent Keras models:\n",
    "\n",
    "* an **encoder model** with signature `encoder_inputs -> encoder_state`\n",
    "* a **decoder model** with signature `[decoder_inputs, decoder_state_input] -> [predictions, decoder_state]`\n",
    "\n",
    "This way, we will be able to encode the source language sentence into the vector `encoder_state` using the encoder and feed it to the decoder model along with the `<start>` token at step 1. \n",
    "\n",
    "Given that input, the decoder will produce the first word of the translation, by sampling from the `predictions` vector (for simplicity, our sampling strategy here will be to take the next word to be the one whose index has the maximum probability in the `predictions` vector) along with a new state vector, the `decoder_state`. \n",
    "\n",
    "At this point, we can feed again to the decoder the predicted first word and as well as the new `decoder_state` to predict the translation second word. \n",
    "\n",
    "This process can be continued until the decoder produces the token `<stop>`. \n",
    "\n",
    "This is how we will implement our translation (or decoding) function, but let us first extract a separate encoder and a separate decoder from our trained encoder-decoder model. \n",
    "\n",
    "\n",
    "**Remark:** If we have already trained and saved the models (i.e, `LOAD_CHECKPOINT` is `True`) we will just load the models, otherwise, we extract them from the trained network above by explicitly creating the encoder and decoder Keras `Model`s with the signature we want.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 7\n",
    "\n",
    "Create the Keras Model `encoder_model` with signature `encoder_inputs -> encoder_state` and the Keras Model `decoder_model` with signature `[decoder_inputs, decoder_state_input] -> [predictions, decoder_state]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if LOAD_CHECKPOINT:\n",
    "    encoder_model = load_model(os.path.join(MODEL_PATH, 'encoder_model.h5'))\n",
    "    decoder_model = load_model(os.path.join(MODEL_PATH, 'decoder_model.h5'))\n",
    "\n",
    "else:\n",
    "    encoder_model = # TODO\n",
    "    \n",
    "    decoder_state_input = Input(shape=(HIDDEN_UNITS,), name=\"decoder_state_input\")\n",
    "\n",
    "    # Reuses weights from the decoder_rnn layer\n",
    "    decoder_outputs, decoder_state = decoder_rnn(\n",
    "        decoder_inputs_embedded, initial_state=decoder_state_input)\n",
    "\n",
    "    # Reuses weights from the decoder_dense layer\n",
    "    predictions = decoder_dense(decoder_outputs)\n",
    "\n",
    "    decoder_model = # TODO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 8\n",
    "\n",
    "Now that we have a separate encoder and a separate decoder, implement a translation function, to which we will give the generic name of `decode_sequences` (to stress that this procedure is general to all seq2seq problems). \n",
    "\n",
    "`decode_sequences` will take as input\n",
    "* `input_seqs` which is the integerized source language sentence tensor that the encoder can consume\n",
    "* `output_tokenizer` which is the target languague tokenizer we will need to extract back words from predicted word integers\n",
    "* `max_decode_length` which is the length after which we stop decoding if the `<stop>` token has not been predicted\n",
    "\n",
    "\n",
    "**Note**: Now that the encoder and decoder have been turned into Keras models, to feed them their input, we need to use the `.predict` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def decode_sequences(input_seqs, output_tokenizer, max_decode_length=50):\n",
    "    \"\"\"\n",
    "    Arguments:\n",
    "    input_seqs: int tensor of shape (BATCH_SIZE, SEQ_LEN)\n",
    "    output_tokenizer: Tokenizer used to conver from int to words\n",
    "\n",
    "    Returns translated sentences\n",
    "    \"\"\"\n",
    "    # Encode the input as state vectors.\n",
    "    states_value = encoder_model.predict(input_seqs)\n",
    "\n",
    "    # Populate the first character of target sequence with the start character.\n",
    "    batch_size = input_seqs.shape[0]\n",
    "    target_seq = tf.ones([batch_size, 1])\n",
    "\n",
    "    decoded_sentences = [[] for _ in range(batch_size)]\n",
    "\n",
    "    for i in range(max_decode_length):\n",
    "\n",
    "        output_tokens, decoder_state = decoder_model.predict(\n",
    "            [target_seq, states_value])\n",
    "\n",
    "        # Sample a token\n",
    "        sampled_token_index = # TODO\n",
    "\n",
    "        tokens = # TODO\n",
    "\n",
    "        for j in range(batch_size):\n",
    "            decoded_sentences[j].append(tokens[j])\n",
    "\n",
    "        # Update the target sequence (of length 1).\n",
    "        target_seq = tf.expand_dims(tf.constant(sampled_token_index), axis=-1)\n",
    "\n",
    "        # Update states\n",
    "        states_value = decoder_state\n",
    "\n",
    "    return decoded_sentences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we're ready to predict!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentences = [\n",
    "    \"No estamos comiendo.\",\n",
    "    \"Está llegando el invierno.\",\n",
    "    \"El invierno se acerca.\",\n",
    "    \"Tom no comio nada.\",\n",
    "    \"Su pierna mala le impidió ganar la carrera.\",\n",
    "    \"Su respuesta es erronea.\",\n",
    "    \"¿Qué tal si damos un paseo después del almuerzo?\"\n",
    "]\n",
    "\n",
    "reference_translations = [\n",
    "    \"We're not eating.\",\n",
    "    \"Winter is coming.\",\n",
    "    \"Winter is coming.\",\n",
    "    \"Tom ate nothing.\",\n",
    "    \"His bad leg prevented him from winning the race.\",\n",
    "    \"Your answer is wrong.\",\n",
    "    \"How about going for a walk after lunch?\"\n",
    "]\n",
    "\n",
    "machine_translations = decode_sequences(\n",
    "    utils_preproc.preprocess(sentences, inp_lang),\n",
    "    targ_lang,\n",
    "    max_length_targ\n",
    ")\n",
    "\n",
    "for i in range(len(sentences)):\n",
    "    print('-')\n",
    "    print('INPUT:')\n",
    "    print(sentences[i])\n",
    "    print('REFERENCE TRANSLATION:')\n",
    "    print(reference_translations[i])\n",
    "    print('MACHINE TRANSLATION:')\n",
    "    print(machine_translations[i])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Checkpoint Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 9\n",
    "\n",
    "Save\n",
    "* `model` to disk as the file `model.h5`\n",
    "* `encoder_model` to disk as the file `encoder_model.h5`\n",
    "* `decoder_model` to disk as the file `decoder_model.h5`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not LOAD_CHECKPOINT:\n",
    "\n",
    "    os.makedirs(MODEL_PATH, exist_ok=True)\n",
    "\n",
    "    # TODO\n",
    "\n",
    "    with open(os.path.join(MODEL_PATH, 'encoder_tokenizer.pkl'), 'wb') as fp:\n",
    "        pickle.dump(inp_lang, fp)\n",
    "\n",
    "    with open(os.path.join(MODEL_PATH, 'decoder_tokenizer.pkl'), 'wb') as fp:\n",
    "        pickle.dump(targ_lang, fp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation Metric (BLEU)\n",
    "\n",
    "Unlike say, image classification, there is no one right answer for a machine translation. However our current loss metric, cross entropy, only gives credit when the machine translation matches the exact same word in the same order as the reference translation. \n",
    "\n",
    "Many attempts have been made to develop a better metric for natural language evaluation. The most popular currently is Bilingual Evaluation Understudy (BLEU).\n",
    "\n",
    "- It is quick and inexpensive to calculate.\n",
    "- It allows flexibility for the ordering of words and phrases.\n",
    "- It is easy to understand.\n",
    "- It is language independent.\n",
    "- It correlates highly with human evaluation.\n",
    "- It has been widely adopted.\n",
    "\n",
    "The score is from 0 to 1, where 1 is an exact match.\n",
    "\n",
    "It works by counting matching n-grams between the machine and reference texts, regardless of order. BLUE-4 counts matching n grams from 1-4 (1-gram, 2-gram, 3-gram and 4-gram). It is common to report both BLUE-1 and BLUE-4\n",
    "\n",
    "It still is imperfect, since it gives no credit to synonyms and so human evaluation is still best when feasible. However BLEU is commonly considered the best among bad options for an automated metric.\n",
    "\n",
    "The NLTK framework has an implementation that we will use.\n",
    "\n",
    "We can't run calculate BLEU during training, because at that time the correct decoder input is used. Instead we'll calculate it now.\n",
    "\n",
    "For more info: https://machinelearningmastery.com/calculate-bleu-score-for-text-python/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def bleu_1(reference, candidate):\n",
    "    reference = list(filter(lambda x: x != '', reference))  # remove padding\n",
    "    candidate = list(filter(lambda x: x != '', candidate))  # remove padding\n",
    "    smoothing_function = nltk.translate.bleu_score.SmoothingFunction().method1\n",
    "    return nltk.translate.bleu_score.sentence_bleu(\n",
    "        reference, candidate, (1,), smoothing_function)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def bleu_4(reference, candidate):\n",
    "    reference = list(filter(lambda x: x != '', reference))  # remove padding\n",
    "    candidate = list(filter(lambda x: x != '', candidate))   # remove padding\n",
    "    smoothing_function = nltk.translate.bleu_score.SmoothingFunction().method1\n",
    "    return nltk.translate.bleu_score.sentence_bleu(\n",
    "        reference, candidate, (.25, .25, .25, .25), smoothing_function)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 10\n",
    "\n",
    "Let's now average the `bleu_1` and `bleu_4` scores for all the sentence pairs in the eval set. The next cell takes around **15 minutes** time to run, the bulk of which is decoding the 1000 sentences in the validation set. Please wait unitl completes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "num_examples = len(input_tensor_val)\n",
    "bleu_1_total = 0\n",
    "bleu_4_total = 0\n",
    "max_sentences = 1000\n",
    "\n",
    "\n",
    "for idx in range(min(num_examples, max_sentences)):\n",
    "    reference_sentence = utils_preproc.int2word(\n",
    "        targ_lang, target_tensor_val[idx][1:])\n",
    "\n",
    "    decoded_sentence = decode_sequences(\n",
    "        input_tensor_val[idx:idx+1], targ_lang, max_length_targ)[0]\n",
    "\n",
    "    bleu_1_total += # TODO\n",
    "    bleu_4_total += # TODO\n",
    "\n",
    "    if idx + 1 >= max_sentences:\n",
    "        break\n",
    "\n",
    "print('BLEU 1: {}'.format(bleu_1_total/num_examples))\n",
    "print('BLEU 4: {}'.format(bleu_4_total/num_examples))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Results\n",
    "\n",
    "**Hyperparameters**\n",
    "\n",
    "- Batch_Size: 64\n",
    "- Optimizer: adam\n",
    "- Embed_dim: 256\n",
    "- GRU Units: 1024\n",
    "- Train Examples: 24,000\n",
    "- Epochs: 10\n",
    "- Hardware: P100 GPU\n",
    "\n",
    "**Performance**\n",
    "- Training Time: 5min \n",
    "- Cross-entropy loss: train: 0.0722 - val: 0.9062\n",
    "- BLEU 1: 0.2519574312515255\n",
    "- BLEU 4: 0.04589972764144636"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### References\n",
    "\n",
    "- Francois Chollet: https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py\n"
   ]
  }
 ],
 "metadata": {
  "environment": {
   "name": "tf2-gpu.2-1.m68",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-1:m68"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
